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Abstract 

Joint sparsity offers powerful structural cues for feature selection, especially for variables that are expected 
to demonstrate a "grouped" behavior. Such behavior is commonly modeled via group-lasso, multitask lasso, 
and related methods where feature selection is effected via mixed-norms. Several mixed-norm based sparse 
models have received substantial attention, and for some cases efficient algorithms are also available. Surpris- 
ingly, several constrained sparse models seem to be lacking scalable algorithms. We address this deficiency by 
presenting batch and online (stochastic-gradient) optimization methods, both of which rely on efficient projec- 
tions onto mixed-norm balls. We illustrate our methods by applying them to the multitask lasso. We conclude 
by mentioning some open problems. 

Keywords: Mixed-norm, Group sparsity, Fast projection, multitask learning, matrix norms, stochas- 
tic gradient 



1 Introduction 

Sparsity encodes key structural information about data and permits estimating unknown, high-dimensional vec- 
tors robustly. No wonder, sparsity has been intensively studied in signal processing, machine learning, and 
statistics, and widely applied to many tasks therein. But the associated literature has grown too large to be 
summarized here; so we refer the reader to [2. 28. 35. 36 as starting points. 

Sparsity constrained problems are often cast as instances of the following high-level optimization problem 

mm xeRd L(x) + Xf(x), (1.1) 

where L is a differentiable loss-function, / is a convex (nonsmooth) regularizer, and A > is a scalar. Alterna- 
tively, one may prefer the constrained formulation 

mm xeRd L(x) s.t. f(x) < 7. (1.2) 

Both formulations ( ll.lt and jl.2\ continue to be actively researched, the former perhaps more than the latter. 
We focus on the latter, primarily because it often admits simple but effective first-order optimization algorithms. 
Additional benefits that make this constrained formulation attractive include: 

• Even when the loss L is nonconvex, gradient-projection remains applicable; 

• If the loss is separable, it is easy to derive highly scalable incremental or stochastic -gradient based opti- 
mization algorithms; 

* Preprint of a paper under review 
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• If only inexact projections onto f(x) < 7 are possible (a realistic case), convergence analysis of gradient- 
projection-type methods remains relatively simple. 

In this paper, we study a particular subclass of (11.2) that has recently become important, namely, groupwise 
sparse regression. Two leading examples are multitask learning [15. 16. 24. 30 1 and group-lasso [3,44,43. A 
key component of these regression problems is the regularizer f(x), which is designed to enforce 'groupwise 
variable selection' — for example, with f(x) chosen to be a mixed-norm. 

Definition 1 (Mixed-norm). Let 1 e l d be partitioned into subvectors x 1 G E*, for i G [mjj. The £ p _ q -mixed- 
norm for p, q > 1, is then defined as 

f(x) = \\x\\ p<q :=(J2Zj x %) 1/P - d-3) 

The most practical instances of ( 11 .3) are £i, g -norms, especially for q G {1, 2, 00}. The choice q = 1 yields 
the ordinary ^i-norm penalty; q = 2 is used in group-lasso [45 1, while q = 00 arises in compressed sensing [43 1 
and multitask lasso [24|. Less common, though potentially useful versions allow interpolating between these 
extremes by letting q G (1, 00); see also 122, 34,461. 

Definition Q] can be substantially generalized: we may allow the subvectors x % to overlap; or to even be 
normed differently [47 1 . But unless the overlapping has special structure [19, 27, 28], it leads to somewhat 
impractical mixed-norms, as the corresponding optimization problem (11. 2b becomes much harder. Since our 
chief aim is to develop fast, scalable algorithms for (11.2) , we limit our discussion to t\ g -norms — this choice is 
widely applicable, hence important El[ll[l5l[l7ll2Qll31|3Ul|40llH . 

Before moving onto the technical part, we briefly list the paper's main contents]! 

• Batch and online (stochastic-gradient based) algorithms for solving d 1 .2b ; 

• Theory of and algorithms for fast projection onto l\ i9 -norm balls; 

• Application to £i g -norm based multitask lasso; both batch and online versions; 

• Application to computing projections for matrix mixed-norms; 

• A set of open problems. 

2 Basic theory 

We begin by developing some basic theory. Our aim is to efficiently implement a generic 'first-order' algorithm: 
Generate a sequence {xt} by iterating 

x t+ i =proj / (ir i -TjtVt), t = 0,l,..., (2.1) 

where rj > is a stepsize, Vt is an estimate of the gradient, and proj * is the projection operator that enforces the 
constraint f(x) < 7. Below we expand on the most challenging component of iteration d2.1| ) when applied to 
mixed-norm regression, namely efficient computation of the projection operator proj j . 

2.1 Efficient projection via proximity 

Formally, the (orthogonal) projection operator projy : R d — > M. d is defined as 

proj/Cy) := argmux, ^ ||^ — 2/||I s.t. f(x) < 7. (2.2) 
Closely tied to projection is the proximity operator prox ;i : R d x R + — > K d 

prox h (y,d) := axgrnia,, - y\\l + 6h{x), (2.3) 

'We use [m] as a shorthand for the set {1,2,..., m}. 

2 Which also helps position this paper relative to its precursor at ECML 201 1 1411 . 
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where h is a convex function on K d . Operator (12.3b generalizes projections: if in (12. 3) the function /i is chosen to 
be the indicator function for the set {x : f(x) < 7}, then the operator prox^ reduces to the projection operator 
proj/. 

Alternatively, for convex / and h, operators proj * and prox^ are also intimately connected by duality. Indeed, 
this connection proves key to computing a projection efficiently whenever its corresponding proximity operator 
is 'easier'. The idea is simple (see e.g., [31 1), but exploiting it effectively requires some care; let us see how. 

Let C(x,9) be the Lagrangian for (12.2) ; and let the optimal dual solution be denoted by 9*. Assuming 
strong-duality, the optimal primal solution is given by 

x(9*) :=argmin x C(x,6*) := argmin x § \\x - y\\ 2 2 + 0* (f(x) - 7). (2.4) 

But to compute ( 12.41 ), we require the optimal 9* — the key insight on obtaining 9* is that it can be computed by 
solving a single nonlinear equation. Here is how. 

First, observe that if f(y) < 7, then x(9*) = y, and there is nothing to compute. Thus, assume that f(y) > 7; 
then, the optimal point x (9*) satisfies 

=7- (2-5) 

Next, observe from (12.4b that for a fixed 9, the point x(9) equals the operator prox^(y, 9). Consider, therefore, 
the nonlinear function (residual) 

g(9) := f(x(8)) - 7 = /( P ro X/ (y, 9)) - 7, (2.6) 

which measures how accurately equation (12.5b is satisfied. The optimal 8* can be then obtained by solving 
g(9) = 0, for which the following lemma proves very useful. 

Lemma 2. Let f(x) be a gaugJ^ and let g(9) be as defined in ( 12.6b . Then, there exists an interval [0, f? max ], on 
which g(9) is monotonically decreasing, and differs in sign at the endpoints. 

Proof. By assumption on f(y), it holds that g(0) = f(y) — 7 > 0. We claim that for 9 > f°(y), where f° 
denotes the polar of /, the optimal point x(9) = 0. To see why, suppose that 9 > f°(y), but x(9) ^ 0. Then, 
h\\x{0) -y\\l + 6f{x{0)) < But since || -||| is strictly convex, the inequality ||y||| - ||a; - y\\\ < 2{y, x) 

also holds for any x. Thus, it follows that 9 < (y, x(9)) / f(x(9)), whereby, for 9 > sup x ^ (y, x)/ f(x) = 
f°(y), the optimal x(9) must equal 0. Hence, we may select max = ,f°(y)- Monotonicity of g follows easily, 
as it is the derivative of the concave (dual) function inf^ C(x, 9). Finally, g(8 max ) = —7 < 0, so it differs in 
sign. □ □ 

Since g(9) is continuous, changes sign, and is monotonic in the interval [0, m ax], it has a unique root therein. 
This root can be computed to e-accuracy using bisection in O(log(0 max /e)) iterations. We recommend not to 
use mere bisection, but rather to invoke a more powerful root -finder that combines bisection, inverse quadratic 
interpolation, and the secant method (e.g., Matlab's f zero function). Pseudocode encapsulating these ideas 
is given in Algorithm Q] 

2.1.1 Projection onto l\ 9 -norm balls 

After the generic approach above, let us specialize to projections for the case of central interest to us, namely, 
projj with fix) — £i tq (x). Algorithm Q] requires computing the upper bound 9 max = f°(y). To that end 
Lemma|3] which actually proves much more, proves useful. 

Lemma 3 (Dual-norm). Let p,q > 1; andletp*,q* > 1 be "conjugate" scalars, i.e., l/p+ 1/p* = 1 and 
l/q+ 1/q* = 1. The polar (dual-norm) of \\-\\p. q is \\-\\p*,q*. 

3 That is, / is nonnegative, positively homogeneous, and disappears at the origin |37 §15] 
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Input: Subroutine to compute proxj(y, 0); vector y; scalar 7 > 
Output: x* := pmj f (y,j) 

if f{y) < 7 then 
I return x* = y 
else 

Define g(9) := proxy(y, ff) — 7; 
Compute interval [0 min , max ] = [0,/°(y)]; 
Compute root 6* = FiNDRoOT(g(6>), min , 6> max ); 

return x* = proxj(y, 0*) 

Algorithm 1: Root-finding for projection via proximity 



Proof. By definition, the norm dual to an arbitrary norm || • || is given by 

||u||, :=sup{(x, u) I < 1}. (2.7) 

To prove the lemma, we prove two items: (i) for any two (conformally partitioned) vectors x and u, we have 
|(x, u)\ < \\%\\p,q\\u\\p*q*; and (ii) for each u, there exists an x for which (x, u) = \\y\\ p *,q*. 
Let x be a vector partitioned conformally to u, and consider the inequality 

(x,u)=Y, 9 , II^IUKIIs** (2-8) 

* — ^z— 1 * — ^l— 1 

which follows from Holder's inequality. Define ^ = and £ = and invoke Holder's inequality 

again to obtain (t/j, £) < HV'IIpIICIIp* — IMIp,<jIMIp*.<3* ■ Thus, from definition ( I2.71 i we conclude that < 
Il w llp*.q*- To prove that the dual norm actually equals ||it||p*,g», we show that for each u, we can find an x that 
satisfies ||x|| p .g = 1, for which the inner-product (x, u) = \\u\\ p * q *. 

Define therefore /3 = J^i \\ u% \\q' — some juggling with indices suggests that we should set 

1 WvtwK 

- ' 'i^-sgnXOIuil'- 1 , (2.9) 



where denotes the j-the element of the subvector x % (similarly 11*). To see that j2.9i works, first consider the 
inner-product 



x-u) 
'i J J 



(x, «>=£<<*<, «*>=£Xj 

^£> < ll£~ 4 *£>$r (-nee sgn(^)u} 



pi/p 



i/p* 



= (£jhi?) 1/p = NIp*,«*- 

Next, we check that \\x\\ p , q = ||^||p) 1/p = 1. Consider thus, the term ||ar f ||^ = (V \x)\ q ) p/q . Using d^9b 
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we have 
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(since (?((?*) _1 + 1 = q). 



Thus, it follows that 

11**11? = (E, l^l 9 ) P/9 = ^IKII^*^ = ^ll^ll9* P *> (2-10) 
where the last equality holds because 1/p + 1/p* = 1. Finally, from ( 12. 1 Ob it follows that 

Ni M = E i H a!< ii?) 1/ ' = (iE 4 ii tt< iif) 1/, ' = 1 . ^ 

since by definition ft = J^i ■ This concludes the proof. □ 

The next key component for Algorithm[T]is the proximity operator prox^. For f(x) = \\x\\ this operator 
requires solving 

mm YTih^-v'^ + OYrM^- (2 ' 12) 

Fortunately, Problem (12.12b separates into a sum of m independent, £ 9 -norm proximity operators. It suffices, 
therefore, to only consider a subproblem of the form 

min n \\\u-v\\l + 9\\u\\ q . (2.13) 

For 9=1, the solution to ( 12.13b is given by the soft-thresholding operation fl3l : 

u{9) = sgn(v) 0max(|u| -6,0), (2.14) 

where operator performs elementwise multiplication. For q = 2, we get 

u(9) = max(l - 9\\v\\2 1 , 0)v, (2.15) 

while the case q = oo is slightly more involved. It can be solved via the Moreau decomposition [ 1 1 1, which, for 
a norm / = || -|| implies that 

proxj(u,0) = v - projyo (v, 9). (2.16) 

For / = || - 1| oo, the dual-norm (polar) is f° — ||-||i; but projection onto £i-balls has been extremely well- 
studied— see e.g., Il2"ni25ll29l. 

For q > 1 (different from 2 and oo), problem ( 12.13b is much harder. Fortunately, this problem was recently 
solved in ||26l , using nested root-finding subroutines. But unlike the cases q £ {1, 2, oo}, the proximity operator 
for general q can be computed only approximately (i.e., in ( 12.6b , each iteration generates only approximate x(9)). 



2.1.2 Mixed norms for matrices: a brief digression 

We now make a brief digression, which is afforded to us by the above results. Our digression concerns mixed- 
norms for matrices, as well as their associated projection, proximity operators, which ultimately depend on the 
results of the previous section. 
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Our discussion is motivated by applications in 11421 , where the authors used mixed-norms on matrices to 
simultaneously. We define mixed-norms on matrices by building upon the classic Schatten-q matrix norms Q, 
defined as: 

IWUHE, ?^)) 17 '. forq>l, (2.17) 

where X is an arbitrary complex matrix, and <Ji(X) is its ith singular value. Now, let X = {X 1 , . . . , X m } be an 
arbitrary set of matrices, and let p,q > 1. We define the matrix (p, q)-norm by the formula 

nxii(M) : = (EZx " xi ii") 1/p (2 - 18) 

As for the vector case, we have a similar lemma about norms dual to ( 12.181 ). 

Lemma 4 (Matrix Holder inequality). Let X and Y be matrices such that tt{X*Y) is well-defined. Then, for 
p > 1, such that 1/p + 1/p* = 1, it holds that 

\(X,Y)\ = \ti(X*Y)\<\\X\\ p \\Y\\ p ,. (2.19) 

Proof. From the well-known von Neumann trace inequality |[T8l §3.3] we know that 

|tr(X*Y)| < y2ai(X)ai(Y) = (a(X), a(Y)). 

Now invoke the classical Holder inequality and use definition (fZTTJ of matrix mixed-norms to conclude. □ 

Lemma 5 (Dual norms). Let p,q > 1; and let p* , q* be their conjugate exponents. The norm dual to is 

IHI(p%?*)- 

Proof. By the triangle-inequality and Lemma|4]we have 

|(X, Y)| = fc.iX*, F'>| < U\ Y l )\ < \\X%\\Y%.. 
Applying Holder's inequality to the latter term we obtain 

Ejl^lWl^iU* ^ l|X|l( M )ll Y ll( P *, g *)- (2-20) 

Now, we must show that for any Y, we can find an X such that (12.201 i holds with equality. To that end, let 
Y l = P l S l Q* betheSVD of matrix Y\ Setting X 1 = P l Y ll Q*, we see that |(X, Y)| = £ 2 since both 

E» and Si are diagonal, this reduces to the vector case (12.91 . completing the proof. □ 

Projections onto | | (i g )-norm balls: 

As for vectors, we now consider the matrix (1 , g)-norm projection 

nr.n A \ E"\ lU l -Y l \\l s.t ,11**11* < 7- (2-21) 

Algorithm[T]can be used to solve (12.211 >. The upper bound max can be obtained via Lemma|5] It only remains 
to solve proximity subproblems of the form 

mm x ±\\X-Y\\$ + e\\X\\ q . (2.22) 

Since both ||-[[p and ||-||, are unitarily invariant, from Corollary 2.5 of lETl it follows that if Y l has the singular 
value decomposition Y = U Dg(y)V*, then ( I2.221 i is solved by X = U T>g(x)V*, where the vector x is obtained 
by solving 

x :=proX||.|| 5 (j/) := argmia,. ±\\x - y\\% + 0\\x\\ q . 

We note in passing that operator ( 12.22) generalizes the popular singular value thresholding operator [Hill , which 
coiTesponds to q = 1 (trace norm). 
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3 Algorithms for solving (|1.2|) 



We describe two realizations of the generic iteration (I2.lt that can be particularly effective: (i) spectral projected 
gradients; and (ii) stochastic-gradient descent. 



3.1 Batch method: spectral projected gradient 

The simplest method to solve ( 1 1.21 ) is perhaps gradient-projection [38), where starting with a suitable initial point 
xo, one iterates 

x t+ i =proj / (a; t -rj t VL(x t )), t = Q, 1, .... (3.1) 

We have already discussed proj^; the other two important parts of (13.1) are the stepsize r) t , and the gradient VL. 
Even when the loss L is not convex, under fairly mild condition, we may still iterate J3.ll ) to obtain convergence 
to a stationary point — see B Chapter 1] for a detailed discussion, including various strategies for computing 
stepsizes. If, however, L is convex, we may invoke a method that typically converges much faster: spectral 
projected gradient (SPG) [ 8 ] . 

SPG extends ordinary gradient-projection by using the famous (nonmonotonic) spectral stepsizes of Barzilai 
and Borwein |4| (BB). Formally, these stepsizes are 

(Ax u Ax t ) (Ax t , Agt) 

Vbbi ■= jt t — r, or r)BB2 := jt t— r, (3.2) 

(Ag t , Ax t ) (Ag t , Ag t ) 

where Ax t — x t — x t -i, and Ag t = VL(x t ) — VL(x t _i). 

SPG substitutes stepsizes ( 13.2b in ( 13.11 ) (using safeguards to ensure bounded steps). Thereby, it leverages the 
strong empirical performance enjoyed by BB stepsizes [4. 8. 1211391 : to ensure global convergence, SPG invokes a 
nonmontone line search strategy that allows the objective value to occasionally increase, while maintaining some 
information that allows extraction of a descending subsequence. 



Inexact projections: Theoretically, the convergence analysis of SPG [8| depends on access to a subroutine 
that computes proj^ exactly. Obviously, in general, this operator cannot be computed exactly (including for 
many of the mixed-norms). To be correct, we must rely on an inexact SPG method such as |9|. In fact, due to 
roundoff error, even the so-called exact methods run inexactly. So, to be fully correct, we must treat the entire 
iteration (13. lb as being inexact. Such analysis can be done (see e.g., [32]); but it is not one of the main aims of 
this paper, so we omit it. 

3.2 Stochastic-gradient method 

Suppose the loss-function L in ( 11.2b is separable, that is, 

L(x) = V r i t {x), where x G R d , (3.3) 

^ — ^i— 1 

for some large number r of components (say r ^> d). In such a case, computing the entire gradient VL at each 
iteration ( 13.1b may be too expensive, and it might be more preferable to use stochastic-gradient descent (SGDf| 
instead. In its simplest realization, at iteration t, SGD picks a random index s(t) G [r], and replaces VL(x) by a 
stochastic estimate V£ s ( t )( x )- This results in the iteration 

x t+ i=pToj f (x t -r] t V£ s (t)(xt)), £ = 0,1,..., (3.4) 

where r\ t are suitable (e.g., r\ t oc 1/t) stepsizes. Again, some additional analysis is also needed for ( 13.4b to 
account for the potential inexactness of the pr ojections. 

4 This popular name is a misnomer because SGD does not necessarily lead to descent at each step. 
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Table 1: Runtime and accuracy for QP and FP on a 10, 000 x 300 matrix V . 

4 Experimental results and applications 

We present below numerical results that illustrate the computational performance of our methods. In particular, 
we show the following main experiments: 

1 . Running time behavior of our root-finding projection methods, including 

• Comparisons against the method of 1 33 1 for £i jOC projections 

• Some results on £i jQ projections for a few different values of q. 

2. Application to the £i !(x) -norm multitask lasso [24 1, for which we show 

• Running time behavior of SPG, both with our projection and that of l33l : 

• Derivation of and numerical results with a SGD based method for MTL. 

4.1 Projection onto the l\ oo-ball 

For ease of comparison, we use the notation of [33 1, who seem to be the first to consider efficient projections 
onto the £i j00 -norm ball. The task is to solve 

minw \\\W~V\\l s.t. V d IKH^ < 7 , (4.1) 

where W is a d x n matrix, and w l denotes its ith row. 

In our comparisons, we refer to the algorithm of ll33l (C imple mentation J3, as 'QP'@ and to our method 
as 'FP' (also C implementation). The experiments were run on a single core of a quad-core AMD Opteron 
(2.6GHz), 64bit Linux machine with 16GB RAM. 

We compute the optimal W* , as 7 varies from 0.01|| V||i j00 (more sparse) to 0.6||F||i iOO (less sparse) settings. 
Tables QJOpresent running times, objective function values, and errors (as measured by the constraint violation: 
I7 — ||W*||i )00 |, for an estimated W*). The tables also show the absolute difference in objective value between 
QP and FP. While for small problems, QP is very competitive, for larger ones, FP consistently outperforms it. 
Although on average FP is only about twice as fast as QP, it is noteworthy that despite FP being an "inexact" 
method (and QP an "exact" one), FP obtains solutions of accuracy many magnitudes of order better than QP. 

4.2 Projection onto 

Next we show running time behavior displayed our method for projecting onto £\. q balls; we show results for 
q G {1.5, 2.5, 3, 5}, when solving 

min w \\W-V\\l s.t. V d \\w l \\ q . (4.2) 

^ — ^i— 1 

5 http.V/www. hi. upc. edu/^aquattoni/CodeToShare/ 

6 The runtimes for QP reported in this paper differ significantly from those in our previous paper 1411 . This difference is due to an 
unfortunate bug in the previous implementation of [33 ] , which got uncovered after the authors of 1 33 saw our experimental results in 1411 . 
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Table 2: Runtime and accuracy for QP and FP on a 50, 000 x 1000 matrix V. 
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Table 3: Runtime and accuracy for QP and FP on a 50, 000 x 10, 000 matrix V. For this experiment, QP did not run on our 
machine with 16GB, so we performed this experiment on a machine with 32GB RAM. 



The plots (Figure [TJ also running time behavior as the parameter 7 is varied. These plots reveal four main 
points: (i) the runtimes seem to be largely independent of 7; (ii) for smaller values of q, the projection times are 
approximately same; and (iii) for larger values of, the projection times increase dramatically. 

Moreover, from the actual running times it is apparent our projection code scales linearly with the data size. 
For example, the matrix corresponding to the second bar plot has 25 times more parameters than the first plot, 
and the runtimes reported in the second plot are approximately 25-30 times higher. Although the running times 
scale linearly, a single £i ig -norm projection still takes nontrivial effort. Thus, even though our £i i9 -projection 
method is relatively fast, currently we can recommend it only for small and medium-scale regression problems. 





Figure 1: Running times for ^i^-norm projections as scalars q and ratios 7/H V||i, q vary. The left plot is on a 
1000 x 100 matrix, while the right one is on a 5000 x 500 matrix. 



4.3 Application to Multitask Lasso 

Multitask Lasso (Mtl) [24 44] is a simple grouped feature selection problem, which separates important features 
from less important ones by using information shared across multiple tasks. The feature selection is effected by 
a sparsity promoting mixed-norm, usually the ^i j00 -norm 11241 . 
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Input: Scalar 7 > 0; batchsize b; stepsize sequence: r/o, r/i, . . . 
Output: W* « argmin^ L(W), s.t. ||VK T ||i,oo < 7 
Wo <- 0; 

while -1 converged do 

Pick b different indices in [mn] ; 
Obtain stochastic gradient using ( 14.5b : 

W t «- proj(W t _! - 7? t V4(W t )); 
_ i t + 1; 

return W* 

Algorithm 2: MTL via stochastic-gradient descent 



Formally, Mtl is setup as follows. Let Xj £ R mj xd be the data matrix for task j, where 1 < j < n. Mtl 
seeks a matrix W £ R dx ", each column of which corresponds to parameters for a task; these parameters are 
regularized across features by applying a mixed-norm over the rows of W. This leads to a "grouped" feature 
selection, because if for a row, the norm ||k/||oo = 0, then the entire row w l gets eliminated (i.e., feature i is 
removed). The standard Mtl optimization problem is 

min C{W) := \\\y 3 - X jWj \\l s.t. ^ |K|U < 7, (4.3) 

where the yj are the dependent variables, and 7 > is a sparsity-tuning parameter. Notice that the loss-function 
combines the different tasks (over columns of W), but the overall problem does not decompose into separable 
problems because the mixed-norm constrained is over the rows of W. 

4.3.1 Stochastic-gradient based MTL 

We may rewrite the MTL problem as 

min L(W) |||% - X jWj g = |||y- Xw\\l 

d ( 4 - 4 > 

s-t. V. , IKHoo < 7, 

where we have introduced the notation 

y = vcc(Y), X = Xi®---® X n , and w = vec(W), 

in which vec ( ■ ) is the operator that stacks columns of its argument to yield a long vector, and © denotes the direct 
sum of two matrices. Notice that if it were not for the £i i00 -norm constraint, problem (14.41 would just reduce to 
ordinary least squares. 

The form (14.41 i. however, makes it apparent how to derive a stochastic-gradient method. In particular, suppose 
that we use a "mini-batch" of size b, i.e., we choose b rows of matrix X, say X\,. Let yt denote the corresponding 
rows (components) of y. This subset of rows contributes tb(w) := 5II2/& — Xi,w ||| to the objective ( 14.41 ), whereby 
we have the stochastic-gradient 

V4H = Xl(X b w - y b ). (4.5) 
Then, upon instantiating iteration (13.4b with (14.51 l. we obtain Algorithmic] 

Implementation notes: Despite our careful implementation, for large-scale problems the projection can be- 
come the bottleneck in Algorithm [2] To counter this, we should perform projections only occasionally — the 
convergence analysis is unimpeded, as we may restrict our attention to the subsequence of iterates for which 
projection was performed. Other implementation choices such as size of the mini-batch and the values of the 
stepsizes r/t are best determined empirically. Although tuning r\t can be difficult, this drawback is offset by the 
gain in scalability. 
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4.3.2 Simulation results 



We illustrate running time results of SPG on two large-scale instances of Mtl (see Table |4j . We report running 
time comparisons between two different invocations of an SPG-based method for solving (14.31 >. once with QP 
as the projection method and once with FP — we call the corresponding solvers SPGqp, and SPGfp. We note 
in passing that other efficient Mtl algorithms (e.g., ll20l 261) solve the penalized version; our formulation is 
constrained, so we only show SPG. 



Name 


(m, d, n) 


#nonzeros 


Dl 


(IK, 5K, 10K) 


50 million 


D2 


(10K, 50K, IK) 


500 million 



Table 4: Sparse datasets used for MTL. For simplicity, all matrices Xj (for each task 1 < j < n), were chosen to have the 
same size m x d. 



Dataset 


#projs 


proj QP 


proj F p 


SPGqp 


SPGfp 


Dl 


50 


2275.2s 


1204.3s 


2722.9s 


1728.3s 


D2 


48 


2631.8s 


1362.3s 


3495.1s 


2296.7s 



Table 5: Running times (seconds) on datasets Dl and D2. SPG was used to solve MTL, with stopping tolerance of 10~ . 
Total number of projections required to reach this accuracy are reported as '#projs\ The columns 'projop' and 'projFp', report 
the total time spent by the SPGqp and SPGfp methods for the £i j00 -projections alone. The last two columns report the overall 
time taken by SPGqp and SPGfp. 

The results in Table |4] indicate that for large-scale problems, the savings accrued upon using our faster pro- 
jections (in combination with SPG) can be substantial. 

4.3.3 MTL results on real-world data 

We now show a running comparison between three methods: (i) SPGqp, (ii) SPGfp, and (iii) SGD (with projec- 
tion step computed using FP). For our comparison, we solve Mtl on a subset of the CMU Newsgroups datasefl 
The dataset corresponds to 5 feature selection tasks based on data taken from the following newsgroups: 
computer, politics, science, recreation, and religion. The feature selection tasks are spread over the matrices 
X%, . . . , X5, each of size 2907 x 53975, while the dependent variables yx,. . ■ ,ys correspond to class labels. 



SGD 

SPG-QP 

SPG-FP 




-SGD 

-SPG-QP 

-SPG-FP 



Running time (seconds) 



Figure 2: Running time results on CMU Newsgroups subset (left: less sparse; right: more sparse problem). 

Figure |2]reports running time results obtained by the three methods in question (all methods were initialized 
by the same Wo). As expected, the stochastic-gradient based method rapidly achieves a low-accuracy solution, 
but start slowing down as time proceeds, and eventually gets overtaken by the SPG based methods. Interestingly, 

7 Original at: http://www.es. cmu.edii/^textlearning/; we use the reduced version of 1201 . 
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in the first experiment, SPGqp takes much longer than SPGfp to convergence, while in the second experiment, it 
lags behind substantially before accelerating towards the end. We attribute this difference to the difficulty of the 
projection subproblem: in the beginning, the sparsity pattern has not yet emerged, which drives SPGqp to take 
more time. In general, however, from the figure it seems that either SGD or SPGfp yield an approximate solution 
more rapidly — so for problems of increasingly larger size, we might prefer them@ 

5 Discussion 

We described mixed-norms for vectors, which we then naturally extended also to matrices. We presented some 
duality theory, which enabled us to derive root-finding algorithms for efficiently computing projections onto 
mixed-norm balls, especially for the special class of l\ g -mixed norms. For solving an overall regression problem 
involving mixed-norms we suggested two main algorithms, spectral projected gradient and stochastic-gradient 
(for separable losses). We presented a small but indicative set of experiments to illustrate the computational 
benefits of our ideas, in particular for the multitask lasso problem. 

At this point, several directions of future work remain open — for instance: 

• Designing fast projection methods for certain classes of non-separable mixed norms. Some algorithms 
already exist for particular classes |fTll28l . 

• Studying norm projections with additional simple constraints (e.g., bounds). 

• Extending the fast methods of this paper to non-Euclidean proximity operators. 

• Exploring applications of matrix mixed-norm regularizes . 
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